Question 1: What Makes a Team Win?¶

Assignment: Final Project Deliverable¶

Group Members: Miguel Avila, Genna Moellering, Shaorui Pi¶

Overview¶

In this notebook, we are analyzing the factors that affect the teams' success. To fully answer this question, we explored the 'Elo' dataset and the 'NFL Favorite Teams' dataset. Combining these two data sets allows us to use the author of the 'NFL Favorite Teams' unique ratings for each team across a broad array of categories and measure those ratings against teams' ability to accumulate wins.

Cleaning Data¶

In order to answer this question we need to use the following data sets: nfl_elo (https://github.com/fivethirtyeight/data/tree/master/nfl-elo) and team-picking-categories (https://github.com/fivethirtyeight/data/blob/master/nfl-favorite-team).

The ratings from the 'NFL Favorite Teams' dataset was published in 2016. We assume her ratings were most informed by the 5 years preceeding her publishing date, and therefore restricted our game data to those played from 2010-2015. We filtered the dataset down to those relevant seasons, made the titles more accessible to the audience, and added columns to document the winning team and whether the home team won. There were a few ties in the dataset, three to be exact, so we dropped those values. We then merged the Elo dataset with the 'NFL Favorite Teams' datasetm, dropped the columns we weren't using, made the labeling uniform, and renamed the abbreviations to make the labels more accessible to our audience. We finished by ensuring our datatypes were sensible and added a column to keep track of the total number of wins for each team.

Visualizations¶

(a) We first created a pairplot to see if there is any relationship between varaibles - there wasn't anything of note in this visual.

(b) We created some initial catplots to show how the ratings for Affordability, Behavior, and Coaching measured against total wins. From these graphs, we are able to see that ratings for Coaching ability look slightly more correlated with high win rates.

(c) We decided that a tree map would be the most effective visual to provide an executive overview of our data. We organized it so that it scaled the boxes for each team based on their number of wins compared to other teams, and it uses a rating scale to show how well each team does in each category.

(d) Finally, we decided to use a radar chart to do a one-to-one comparison of the best and the worst teams' ratings across attributes. The radar chart helps us do an effective comparison to highlight differences between the best and worst teams. It also serves an important purpose in giving our audience insight into our methodology. We felt it was important to include the chart for these reasons.

Question Refinement¶

Based on our exploration of the dataset, we refined our subquestions from the first assignment as follows:

(a) Home Field Advantage - How Much Does it Really Matter?

(b) Show Me the Money - Do Teams With Fans that Spend More Win More?

(c) Do the Crime, Do the Time - Does Player Behavior Impact Team Performance?

(d) Leading from the Front - How Important is Coaching to Team Success?

References¶

  • https://stackoverflow.com/questions/18172851/deleting-dataframe-row-in-pandas-based-on-column-value
  • https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
  • https://stackoverflow.com/questions/31460146/plotting-value-counts-in-seaborn-barplot
  • Lecture 3 - Data Wrangling I Slides
  • Lecture 4 - Data Wrangling II Slides
  • Lecture 6 - Categorical Data Slides
  • https://medium.com/@morganjonesartist/color-guide-to-seaborn-palettes-da849406d44f
  • https://towardsdatascience.com/wide-to-long-data-how-and-when-to-use-pandas-melt-stack-and-wide-to-long-7c1e0f462a98
  • https://plotly.com/python/treemaps/
  • https://plotly.com/python/radar-chart/
  • https://www.data-to-viz.com/#treemap
  • Python Data Science Handbook - 3.08 and 3.09
  • https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
  • https://datagy.io/pandas-groupby/
  • https://towardsdatascience.com/ultimate-pandas-guide-mastering-the-groupby-104306251739
  • https://pbpython.com/groupby-agg.html
  • https://stackoverflow.com/questions/68616781/customizing-the-hue-colors-used-in-seaborn-barplot
  • https://colorbrewer2.org/#type=sequential&scheme=Greens&n=3
  • https://www.flaticon.com/free-icons/time-out

Data Cleaning Process¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
In [2]:
elo=pd.read_csv('https://projects.fivethirtyeight.com/nfl-api/nfl_elo.csv', parse_dates=['date'], 
                usecols=['date','team1','team2','score1','score2'])
elo.sample(10)
Out[2]:
date team1 team2 score1 score2
12055 2003-09-28 BAL KC 10.0 17.0
13896 2010-09-19 WSH HOU 27.0 30.0
13553 2008-12-14 MIA SF 14.0 9.0
14491 2012-10-14 ATL OAK 23.0 20.0
11629 2001-11-25 CLE CIN 18.0 0.0
12135 2003-11-09 JAX IND 28.0 23.0
11959 2002-12-21 MIN MIA 20.0 17.0
8110 1987-09-20 NYG DAL 14.0 16.0
16735 2020-12-13 CAR DEN 27.0 32.0
16957 2021-11-14 DEN PHI 13.0 30.0

To simplify our analysis of questions on whether or not factors like coaching ability, fan expenditure, and player behavior affect the number of wins teams accumulate, we decided to use the metrics given by the 'NFL Favorite Team' dataset. The author's analysis was published in 2016. Therefore, we assume her ratings were most informed by the 5 years preceeding her publishing date, and therefore restricted our game data to those played from 2010-2015.

In the following code, we restrict our dataset to games that occurred within the given timeframe, from 2010 to 2015.

In [3]:
opening_day = '2010-09-09'
closing_day = '2015-02-01'
condition1 = elo['date'] >= opening_day
condition2 = elo['date'] <= closing_day
elo = elo[condition1 & condition2]
elo
Out[3]:
date team1 team2 score1 score2
13871 2010-09-09 NO MIN 14.0 9.0
13872 2010-09-12 NYG CAR 31.0 18.0
13873 2010-09-12 PHI GB 20.0 27.0
13874 2010-09-12 JAX DEN 24.0 17.0
13875 2010-09-12 WSH DAL 13.0 7.0
... ... ... ... ... ...
15201 2015-01-11 GB DAL 26.0 21.0
15202 2015-01-11 DEN IND 13.0 24.0
15203 2015-01-18 SEA GB 28.0 22.0
15204 2015-01-18 NE IND 45.0 7.0
15205 2015-02-01 NE SEA 28.0 24.0

1335 rows × 5 columns

We adjust the column names to make them more interpretable to our audience.

In [4]:
elo = elo.rename(columns={'date':'Game_Date', 'team1':'Home_Team', 'team2':'Away_Team', 
                    'score1':'Home_Score', 'score2':'Away_Score'})

We reset the index of our dataframe, which now contains only the desired NFL season.

In [5]:
elo.reset_index(drop=True, inplace=True)

Now that we have the refined dataset, we create a column that shows the winning team. We will use this column later to document total wins.

In [6]:
conditions = [(elo['Home_Score'] > elo['Away_Score']), 
              (elo['Away_Score'] > elo['Home_Score'])]
values = [elo['Home_Team'], elo['Away_Team']]
elo['Winning_Team'] = np.select(conditions, values, default='Tie')
elo.sample(5)
Out[6]:
Game_Date Home_Team Away_Team Home_Score Away_Score Winning_Team
1235 2014-11-23 ATL CLE 24.0 26.0 CLE
158 2010-11-21 NYJ HOU 30.0 27.0 NYJ
896 2013-10-20 PIT BAL 19.0 16.0 PIT
1215 2014-11-13 MIA BUF 22.0 9.0 MIA
583 2012-09-30 TB WSH 22.0 24.0 WSH

There is a chance that the scores were equal and the game resulted in a tie. We see from the following code that there are 3 games in this dataset that ended in ties.

In [7]:
elo.Winning_Team.unique()
Out[7]:
array(['NO', 'NYG', 'GB', 'JAX', 'WSH', 'MIA', 'NE', 'TEN', 'PIT', 'HOU',
       'SEA', 'TB', 'CHI', 'ARI', 'BAL', 'KC', 'ATL', 'IND', 'CIN', 'DEN',
       'PHI', 'OAK', 'NYJ', 'LAC', 'LAR', 'MIN', 'DAL', 'CLE', 'DET',
       'SF', 'CAR', 'BUF', 'Tie'], dtype=object)
In [8]:
elo.groupby('Winning_Team').count().tail()
Out[8]:
Game_Date Home_Team Away_Team Home_Score Away_Score
Winning_Team
SF 55 55 55 55 55
TB 27 27 27 27 27
TEN 30 30 30 30 30
Tie 3 3 3 3 3
WSH 28 28 28 28 28
In [9]:
elo.drop(elo.index[elo['Winning_Team'] == 'Tie'], inplace = True)

The ties have been removed from the dataset.

In [10]:
elo.groupby('Winning_Team').count().tail()
Out[10]:
Game_Date Home_Team Away_Team Home_Score Away_Score
Winning_Team
SEA 57 57 57 57 57
SF 55 55 55 55 55
TB 27 27 27 27 27
TEN 30 30 30 30 30
WSH 28 28 28 28 28

Now we add a column to document whether the home team won or lost the game, which we will use to answer the question of whether home field advantage really matters.

In [11]:
conditions = [(elo['Winning_Team'] == elo['Home_Team'])]
values = [True]
elo['Home_Victory'] = np.select(conditions, values)
elo.sample(5)
Out[11]:
Game_Date Home_Team Away_Team Home_Score Away_Score Winning_Team Home_Victory
953 2013-11-17 PHI WSH 24.0 16.0 PHI 1
326 2011-10-02 LAR WSH 10.0 17.0 WSH 0
1015 2013-12-15 IND HOU 25.0 3.0 IND 1
72 2010-10-10 WSH GB 16.0 13.0 WSH 1
868 2013-10-06 DAL DEN 48.0 51.0 DEN 0

The Elo dataset is now ready for analysis. We can move forward by reading in our second dataset - 'NFL Favorite Teams.'

In [12]:
tpc = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-favorite-team/team-picking-categories.csv')
tpc.sample(5)
Out[12]:
TEAM BMK UNI CCH STX SMK AFF SLP NYP FRL BNG TRD BWG FUT PLA OWN BEH
10 Houston Texans 87 32 55 71 13 48 52 23 81 35 0 39 61 74 68 95
22 Oakland Raiders 10 97 32 3 90 65 0 0 42 0 65 94 35 19 19 69
4 Buffalo Bills 3 81 58 32 97 97 55 81 55 61 48 74 68 58 77 69
29 New York Jets 71 58 23 23 29 3 29 100 16 10 19 68 52 13 23 16
2 Kansas City Chiefs 26 71 84 84 74 71 97 35 71 48 45 77 94 71 87 44

The team names don't match up with our abbreviated team names in the Elo dataset. Because we are mostly interested in assessing the qualities that contribute to wins, and not necessarily which teams have had the best records, we will use the abbreviated team names to save space in the dataframe.

In [13]:
tpc=tpc.replace({'Kansas City Chiefs':'KC','Pittsburgh Steelers':'PIT','New England Patriots':'NE',
            'Buffalo Bills':'BUF','Carolina Panthers':'CAR','Seattle Seahawks':'SEA','Indianapolis Colts':'IND',
            'Arizona Cardinals':'ARI','Baltimore Ravens':'BAL','Houston Texans':'HOU','New Orleans Saints':'NO',
            'Philadelphia Eagles':'PHI','Denver Broncos':'DEN','Detroit Lions':'DET','Minnesota Vikings':'MIN',
            'New York Giants':'NYG','Atlanta Falcons':'ATL','Dallas Cowboys':'DAL','Jacksonville Jaguars':'JAX',
            'Miami Dolphins':'MIA','Cincinnati Bengals':'CIN','Oakland Raiders':'OAK','Tampa Bay Buccaneers':'TB',
            'Los Angeles Rams':'LAR','Chicago Bears':'CHI','Cleveland Browns':'CLE','San Diego Chargers':'LAC',
            'San Francisco 49ers':'SF','New York Jets':'NYJ','Washington Redskins':'WSH','Tennessee Titans':'TEN',
            'Green Bay Packers':'GB'})
tpc.sample(5)
Out[13]:
TEAM BMK UNI CCH STX SMK AFF SLP NYP FRL BNG TRD BWG FUT PLA OWN BEH
7 IND 19 74 68 97 81 77 100 61 94 94 71 23 26 87 48 16
2 KC 26 71 84 84 74 71 97 35 71 48 45 77 94 71 87 44
31 TEN 16 55 10 13 84 32 94 48 19 16 32 71 0 6 13 29
26 CLE 23 39 16 26 77 42 71 74 10 32 77 87 3 10 10 16
10 HOU 87 32 55 71 13 48 52 23 81 35 0 39 61 74 68 95

Although the 'NFL Favorite Teams' dataset has many interesting ratings, we are going to concentrate on the following, which help us answer our questions.

  • Bang for the Buck (BNG) - Wins per fan dollars spent
  • Affordability (AFF) - Price of tickets, parking and concessions
  • Behavior (BEH) - Suspensions by players on the team since 2007, with extra weight to transgressions against women
  • Coaching (CCH) - Strength of on-field leadership

We will drop all other columns from the dataset for clarity.

In [14]:
tpc.drop(columns = ['FRL', 'OWN', 'PLA', 'FUT', 'BWG', 'TRD', 'NYP', 'SLP', 'SMK', 'STX', 'UNI', 'BMK'], inplace=True)
tpc.sample(5)
Out[14]:
TEAM CCH AFF BNG BEH
0 GB 81 94 81 69
19 JAX 61 100 3 29
6 SEA 95 23 100 29
5 CAR 52 84 90 44
30 WSH 3 0 6 3

We will also relabel 'TEAM' name to 'Winning_Team' so we can more easily merge it with our Elo dataset

In [15]:
tpc.rename(columns = {'BNG':'Bang_for_Buck','BEH':'Behavior','AFF':'Affordability','CCH':'Coaching','TEAM':'Winning_Team'},inplace=True)
tpc.sample(5)
Out[15]:
Winning_Team Coaching Affordability Bang_for_Buck Behavior
23 TB 29 68 13 44
18 DAL 42 16 77 29
11 NO 87 87 45 84
28 SF 0 10 65 95
22 OAK 32 65 0 69

We're now ready to merge our two dataframes together.

In [16]:
df = pd.merge(elo, tpc, on = ['Winning_Team'])

Now we can check to make sure our datatypes are consistent.

In [17]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1332 entries, 0 to 1331
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Game_Date      1332 non-null   datetime64[ns]
 1   Home_Team      1332 non-null   object        
 2   Away_Team      1332 non-null   object        
 3   Home_Score     1332 non-null   float64       
 4   Away_Score     1332 non-null   float64       
 5   Winning_Team   1332 non-null   object        
 6   Home_Victory   1332 non-null   int64         
 7   Coaching       1332 non-null   int64         
 8   Affordability  1332 non-null   int64         
 9   Bang_for_Buck  1332 non-null   int64         
 10  Behavior       1332 non-null   int64         
dtypes: datetime64[ns](1), float64(2), int64(5), object(3)
memory usage: 124.9+ KB

Because scores in the NFL can only be integers, we will change the Home_Score and Away_Score to the 'int' dtype to reflect this.

In [18]:
df.Home_Score = df.Home_Score.astype('int64')
In [19]:
df.Away_Score = df.Home_Score.astype('int64')
In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1332 entries, 0 to 1331
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Game_Date      1332 non-null   datetime64[ns]
 1   Home_Team      1332 non-null   object        
 2   Away_Team      1332 non-null   object        
 3   Home_Score     1332 non-null   int64         
 4   Away_Score     1332 non-null   int64         
 5   Winning_Team   1332 non-null   object        
 6   Home_Victory   1332 non-null   int64         
 7   Coaching       1332 non-null   int64         
 8   Affordability  1332 non-null   int64         
 9   Bang_for_Buck  1332 non-null   int64         
 10  Behavior       1332 non-null   int64         
dtypes: datetime64[ns](1), int64(7), object(3)
memory usage: 124.9+ KB
In [21]:
df.sample(5)
Out[21]:
Game_Date Home_Team Away_Team Home_Score Away_Score Winning_Team Home_Victory Coaching Affordability Bang_for_Buck Behavior
590 2013-12-08 ARI LAR 30 30 ARI 1 95 90 68 56
203 2013-11-03 WSH LAC 30 30 WSH 1 3 0 6 3
400 2010-10-17 HOU KC 35 35 HOU 1 55 48 35 95
506 2011-09-25 TB ATL 16 16 TB 1 29 68 13 44
1206 2014-11-09 DET MIA 20 20 DET 1 35 39 58 84

We're almost finished compiling our dataset - the last factor that we're interested in documenting in it's own column is the cumulative number of wins per team. We will use a NamedAgg function to generate a column that shows this information.

In [22]:
wins = elo.groupby('Winning_Team').agg(number_wins = pd.NamedAgg(column='Winning_Team', aggfunc='count'))
wins
Out[22]:
number_wins
Winning_Team
ARI 39
ATL 47
BAL 59
BUF 31
CAR 35
CHI 43
CIN 44
CLE 25
DAL 43
DEN 53
DET 38
GB 62
HOU 41
IND 48
JAX 22
KC 39
LAC 43
LAR 29
MIA 36
MIN 31
NE 70
NO 51
NYG 45
NYJ 39
OAK 27
PHI 42
PIT 53
SEA 57
SF 55
TB 27
TEN 30
WSH 28

We can merge this with our exisitng dataframe to create a complete datset with all the information needed to answer our questions.

In [23]:
df = pd.merge(wins, df, on=['Winning_Team'])
df
Out[23]:
Winning_Team number_wins Game_Date Home_Team Away_Team Home_Score Away_Score Home_Victory Coaching Affordability Bang_for_Buck Behavior
0 ARI 39 2010-09-12 LAR ARI 13 13 0 95 90 68 56
1 ARI 39 2010-09-26 ARI OAK 24 24 1 95 90 68 56
2 ARI 39 2010-10-10 ARI NO 30 30 1 95 90 68 56
3 ARI 39 2010-12-12 ARI DEN 43 43 1 95 90 68 56
4 ARI 39 2010-12-25 ARI DAL 27 27 1 95 90 68 56
... ... ... ... ... ... ... ... ... ... ... ... ...
1327 WSH 28 2013-11-03 WSH LAC 30 30 1 3 0 6 3
1328 WSH 28 2014-09-14 WSH JAX 41 41 1 3 0 6 3
1329 WSH 28 2014-10-19 WSH TEN 19 19 1 3 0 6 3
1330 WSH 28 2014-10-27 DAL WSH 17 17 0 3 0 6 3
1331 WSH 28 2014-12-20 WSH PHI 27 27 1 3 0 6 3

1332 rows × 12 columns

Summary Statistics¶

We do a quick describe function on our dataframe to get a sense of our data. It doesn't tell us much, but we can see from the mean of our 'Home_Victory' column that the home team wins 57.9% of the time, which is important to answering our question about the significance of home field advantage!

In [24]:
df.describe()
Out[24]:
number_wins Home_Score Away_Score Home_Victory Coaching Affordability Bang_for_Buck Behavior
count 1332.000000 1332.000000 1332.000000 1332.000000 1332.000000 1332.000000 1332.000000 1332.000000
mean 44.839339 24.069069 24.069069 0.579580 53.895646 49.379880 55.304805 51.024775
std 11.750653 10.555554 10.555554 0.493812 30.746125 29.840646 29.105165 30.324444
min 22.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 38.000000 17.000000 17.000000 0.000000 26.000000 23.000000 29.000000 29.000000
50% 43.000000 24.000000 24.000000 1.000000 55.000000 48.000000 58.000000 56.000000
75% 53.000000 31.000000 31.000000 1.000000 81.000000 77.000000 81.000000 77.000000
max 70.000000 62.000000 62.000000 1.000000 100.000000 100.000000 100.000000 95.000000

Visualizations¶

We used a pariplot to provide a rough overview of our dataset and get a feel for if there were any correlation between vairables. There were no obvious correlations that we could identify.

In [25]:
plt.figure()
sns.pairplot(df[['Coaching', 'Behavior', 'Bang_for_Buck', 'Affordability','number_wins']])
plt.show()
<Figure size 640x480 with 0 Axes>

We produced initial catplots for each of our advantages of interest to get a feel for how much impact they have.

In [26]:
sns.catplot(data=df, x='Coaching', y='number_wins', kind='bar', palette='RdBu', aspect=30/10)
plt.show()
In [27]:
sns.catplot(data=df, x='Affordability', y='number_wins', kind='bar', palette='RdBu', aspect=30/10)
plt.show()
In [28]:
sns.catplot(data=df, x='Behavior', y='number_wins', kind='bar', palette='RdBu', aspect=30/10)
plt.show()

Tree Map - The Data Overview Visual of Choice¶

We decided to use a treemap, which was capable of providing a holistic overview of our data and show our audience how the 32 teams in the NFL compared by in terms of win records and also in terms of their ratings across our 4 categories of interest.

We created a new dataframe that included only the information we needed for our visual. We dropped unnecessary columns as well as our 'Bang for Buck' attribute. Based on our previous visualizations, we felt Affordability was a better attribute to concentrate on for fan expenditures.

In [29]:
tree_df = df.drop(columns=['Game_Date', 'Home_Team', 'Away_Team', 'Home_Score', 'Away_Score', 'Bang_for_Buck'])

We created a column to document the percentage of home victories to standardize the win rates for each team, because not all teams play the same amount of games every season.

In [30]:
tree_df['%_Home_Wins'] = tree_df.groupby('Winning_Team')['Home_Victory'].transform('sum') * 100 / df['number_wins']

Now that we have that information, we no longer need our binary column documenting whether a team won or lost each game, so we dropped that column to make our dataframe more condusive to the treemap.

In [31]:
tree_df.drop(columns='Home_Victory', inplace=True)
In [32]:
tree_df.shape
Out[32]:
(1332, 6)

At this point, our dataframe has an entry for every game from the 2010-2015 seasons, so we condense the dataframe by dropping dupplicate rows based on the winning team.

In [33]:
tree_df = tree_df.drop_duplicates(subset = 'Winning_Team', keep='first').reset_index(drop=True)
In [34]:
tree_df.shape
Out[34]:
(32, 6)

We renamed our columns so that our attributes are, once again, abbreviations, as these will be easier to read in the sub-boxes of the treemap.

In [35]:
tree_df.rename(columns = {'Winning_Team':'Team','%_Home_Wins':'HW', 'Coaching':'CCH', 'Affordability':'AFF', 'Behavior':'BEH'},inplace=True)
In [36]:
tree_df.columns
Out[36]:
Index(['Team', 'number_wins', 'CCH', 'AFF', 'BEH', 'HW'], dtype='object')

We melted the dataframe so that each of our 4 categories (Home Wins, Coaching, Affordability, and Behavior) and their respective rating values become their own rows under the column headers of 'Category' and 'Rating.'

In [37]:
tree_df = tree_df.melt(id_vars = ['Team', 'number_wins'], var_name='Category', value_name = 'Rating')
In [38]:
tree_df
Out[38]:
Team number_wins Category Rating
0 ARI 39 CCH 95.000000
1 ATL 47 CCH 48.000000
2 BAL 59 CCH 90.000000
3 BUF 31 CCH 58.000000
4 CAR 35 CCH 52.000000
... ... ... ... ...
123 SEA 57 HW 63.157895
124 SF 55 HW 54.545455
125 TB 27 HW 48.148148
126 TEN 30 HW 53.333333
127 WSH 28 HW 50.000000

128 rows × 4 columns

Our dataframe is now ready to convert into a treemap, which gives us an excellent overview of our data set. It scales the box for each team based on their number of wins compared to other teams, and it uses a rating scale to show how well each team does in each category.

In [39]:
# Code adpated from https://plotly.com/python/treemaps/
import plotly.express as px
fig = px.treemap(tree_df, path=['Team', 'Category'], values='number_wins',
                  color='Rating', hover_data=['Team'],
                  color_continuous_scale=['#543005', '#8c510a', '#bf812d', '#d8b365', '#d9f0d3', '#5aae61', '#31a354', '#1b7837', '#00441b'],
                  color_continuous_midpoint=np.average(tree_df['Rating'], weights=tree_df['number_wins']),
                  width = 1000, height = 350)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

We can see from our treemap, which shows the highest scoring teams from left to right, that the top 9 teams generally have very strong coaching ratings, which infers that it could be an important factor in team success. Interestingly, not all good teams are unaffordable, as the Green Bay Packers show as the 2nd highest scoring team in terms of overall wins. Bad teams tend to have more bad behavior, as we can see from the 8 worst performing teams. Finally, no team appears to have a poor score in terms of Home Field victories, which goes to show that this could very well be a universal advantage for good and bad teams alike.

We played with an alternative version of the treemap, which mapped the data first by category and then by team, but we felt the first version was more interpretable and helped us tell our story more effectively

In [40]:
# Code adpated from https://plotly.com/python/treemaps/
import plotly.express as px

fig = px.treemap(tree_df, path=['Category', 'Team'], values='number_wins',
                  color='Rating', hover_data=['Team'],
                  color_continuous_scale=['#543005', '#8c510a', '#bf812d', '#d8b365', '#d9f0d3', '#5aae61', '#31a354', '#1b7837', '#00441b'],
                  color_continuous_midpoint=np.average(tree_df['Rating'], weights=tree_df['number_wins']))
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Radar Chart¶

We sorted the data by number of wins in order to see who the best and worst teams in the leage were so we could do a one-to-one comparison of their ratings across attributes.

In [41]:
df.sort_values(by=['number_wins'])
Out[41]:
Winning_Team number_wins Game_Date Home_Team Away_Team Home_Score Away_Score Home_Victory Coaching Affordability Bang_for_Buck Behavior
620 JAX 22 2012-01-01 JAX IND 19 19 1 61 100 3 29
626 JAX 22 2013-12-05 JAX HOU 27 27 1 61 100 3 29
625 JAX 22 2013-12-01 CLE JAX 28 28 0 61 100 3 29
624 JAX 22 2013-11-24 HOU JAX 6 6 0 61 100 3 29
623 JAX 22 2013-11-10 TEN JAX 27 27 0 61 100 3 29
... ... ... ... ... ... ... ... ... ... ... ... ...
812 NE 70 2010-10-24 LAC NE 20 20 0 100 26 84 77
813 NE 70 2010-10-31 NE MIN 28 28 1 100 26 84 77
814 NE 70 2010-11-14 PIT NE 26 26 0 100 26 84 77
808 NE 70 2010-09-12 NE CIN 38 38 1 100 26 84 77
818 NE 70 2010-12-12 CHI NE 7 7 0 100 26 84 77

1332 rows × 12 columns

The radar chart helps us do an effective comparison to highlight differences between the best and worst teams, but it also serves an important purpose in giving our audience insight into our methodology. We felt it was important to include the chart for these reasons.

First we pull the category ratings for the top team (NE).

In [42]:
NE = tree_df.loc[tree_df['Team'] == 'NE'][['Category', 'Rating']]
NE_values = NE['Rating'].to_list()
NE_values
Out[42]:
[100.0, 26.0, 77.0, 62.857142857142854]

Then we pull the category ratings for the worst team (JAX).

In [43]:
JAX = tree_df.loc[tree_df['Team'] == 'JAX'][['Category', 'Rating']]
JAX_values = JAX['Rating'].to_list()
JAX_values
Out[43]:
[61.0, 100.0, 29.0, 63.63636363636363]

Now we can use this data to generate our radar chart.

In [44]:
#https://plotly.com/python/radar-chart/
import plotly.graph_objects as go

categories = ['Coaching','Affordability', 'Behavior', 'Home Wins']

fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=NE_values,
      theta=categories,
      fill='toself',
      fillcolor='#00441b',
      line_color='#00441b',
      opacity=0.7,
      name='Patriots'
))
fig.add_trace(go.Scatterpolar(
      r=JAX_values,
      theta=categories,
      fill='toself',
      fillcolor='#543005',
      line_color='#543005',
      opacity=0.6,
      name='Jaguars'
))

fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0, 100]
    )),
  showlegend=True
)

fig.update_layout(title='Direct Team Comparison')

fig.show()

We can see from this direct comparison that tbe Patriots have very strong coaching and player behavior scores, but they are an unaffordable team for fans. Conversely, the Jaguars, have relatively low coaching and behavioral scores, but they have a very high score in the affordability category. Interestingly, both teams are equal in terms of Home Field victories, which supports our general assertion that all teams (good and bad) seem to benefit from that advantage.